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Abstract 

In this paper, we consider unsupervised partitioning problems, such as clus- 
tering, image segmentation, video segmentation and other change-point detection 
problems. We focus on partitioning problems based explicitly or implicitly on the 
minimization of Euclidean distortions, which include mean-based change-point 
detection, K-means, spectral clustering and normalized cuts. Our main goal is 
to learn a Mahalanobis metric for these unsupervised problems, leading to fea- 
ture weighting and/or selection. This is done in a supervised way by assuming 
the availability of several potentially partially labelled datasets that share the same 
metric. We cast the metric learning problem as a large-margin structured predic- 
tion problem, with proper definition of regularizers and losses, leading to a convex 
optimization problem which can be solved efficiently with iterative techniques. 
We provide experiments where we show how learning the metric may significantly 
improve the partitioning performance in synthetic examples, bioinformatics, video 
segmentation and image segmentation problems. 

1 Introduction 

Unsupervised partitioning problems are ubiquitous in machine learning and other data- 
oriented fields such as computer vision, bioinformatics or signal processing. They 
include (a) traditional unsupervised clustering problems, with the classical K-means 
algorithm, hierarchical linkage methods [14| and spectral clustering ll22l . (b) unsu- 
pervised image segmentation problems where two neighboring pixels are encouraged 
to be in the same cluster, with mean-shift techniques |9| or normalized cuts ll25l . 
and (c) change-point detection problems adapted to multivariate sequences (such as 
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video) where segments are composed of contiguous elements, with typical window- 
based algorithms [ 1 1 1 and various methods looking for a change in the mean of the 
features (see, e.g., JS)). 

All the algorithms mentioned above rely on a specific distance (or more generally a 
similarity measure) on the space of configurations. A good metric is crucial to the per- 
formance of these partitioning algorithms and its choice is heavily problem-dependent. 
While the choice of such a metric has been originally tackled manually (often by trial 
and error), recent work has considered learning such metric directly from data. Without 
any supervision, the problem is ill-posed and methods based on generative models may 
learn a metric or reduce dimensionality (see, e.g., [10|), but typically with no guaran- 
tees that they lead to better partitions. In this paper, we follow |4 32 3| and consider 
the goal of learning a metric for potentially several partitioning problems sharing the 
same metric, assuming that several fully or partially labelled partitioned datasets are 
available during the learning phase. While such labelled datasets are typically expen- 
sive to produce, there are several scenarios where these datasets have already been 
built, often for evaluation purposes. These occur in video segmentation tasks (see Sec- 
tion 6.1 1, image segmentation tasks (see Section |6~3| l as well as change-point detection 
tasks in bioinformatics (see JT5 1 and Section |5~3| l! 

In this paper, we consider partitioning problems based explicitly or implicitly on 
the minimization of Euclidean distortions, which include K-means, spectral clustering 
and normalized cuts, and mean-based change-point detection. We make the following 
contributions: 

- We review and unify several partitioning algorithms in Section [2] and cast them as 
the maximization of a linear function of a rescaled equivalence matrix, which can 
be solved by algorithms based on spectral relaxations or dynamic programming. 

- Given fully labelled datasets, we cast in Section|4]the metric learning problem as a 
large-margin structured prediction problem, with proper definition of regularizes, 
losses and efficient loss-augmented inference. 

- Given partially labelled datasets, we propose in Section [5] an algorithm, iterating 
between labelling the full datasets given a metric and learning a metric given the 



fully labelled datasets. We also consider in Section 5.3 extensions that allow changes 
in the full distribution of univariate time series (rather than changes only in the 
mean), with application to bioinformatics. 

We provide in Section [6] experiments where we show how learning the metric may 
significanty improve the partitioning performance in synthetic examples, video seg- 
mentation and image segmentation problems. 



Related work. 

The need for metric learning goes far beyond unsupervised partitionning problems. 
BUI proposed a large margin framework for learning a metric in nearest-neighbours 
algorithms based on sets of must-link/must not link constraints, while [ 1 3 1 considers 
a probability-based non-convex formulation. For these works, a single dataset is fully 
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labelled and the goal is to learn a metric leading to good testing performance on unseen 
data. 

Some recent work [ 17 1 proved links between metric learning and kernel learning, 
permitting to kernelize any Mahalanobis distance learning problem. 

Metric learning has also been considered in semi-supervised clustering of a single 
dataset, where some partial constraints are given. This includes the works of l4l l32l . 
both based on efficient convex formulations. As shown in Section^ these can be used 
in our settings as well by stacking several datasets into a single one. However, our 
discriminative large-margin approach outperforms these. 

Moreover, the task of learning how to partition was tackled in J3) for spectral 
clustering. The problem set-up is the same (availability of several fully partitioned 
datasets), however, the formulation is non-convex and relies on the unstable optimiza- 
tion of eigenvectors. In Section [5T| we propose a convex more stable large-margin 



Other approaches do not require any supervision [ 10], and perform dimensionality 
reduction and clustering at the same time, by iteratively alternating the computation of a 
low-rank matrix and a clustering of the data using the corresponding metric. However, 
they are unable to take advantage of the labelled information that we use. 

Our approach can also be related to the one of l26l . Given a small set of labelled 
instances, they use a similar large-margin framework, inspired by |29l to leam pa- 
rameters of Markov random fields, using graph cuts for solving the "loss-augmented 
inference problem" of structured prediction. However, their segmentation framework 
does not apply to unsupervised segmentation (which is the goal of this paper). In this 
paper, we present a supervised learning framework aiming at learning how to perform 
an unsupervised task. 

Our approach to learn the metric is nevertheless slightly different of the ones men- 
tioned above. Indeed, we cast this problem as the solution of a structured SVM as in 
ll29l l27l . This make our paper shares many conceptual steps with works like 171 |2D 
where they use a structured SVM to learn in one case weights for graph matchings and 
a metric for ranking in the other case. 

2 Partitioning through matrix factorization 

In this section, we consider T multi-dimensional observations x\ , . . . , xt € M p , which 
may be represented in a matrix X £ M. TxP . Partitioning the T observations into K 
classes is equivalent to finding an assignment matrix Y g {0, l} Tx K , such that Y. L j = 1 
if the i-th observation is affected to cluster j and otherwise. For general partitioning 
problems, no additional constraints are used, but for change-point detection problems, 
it is assumed that the segments are contiguous and with increasing labels. That is, the 
matrix Y is of the form 
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where Id & R is the D-dimensional vector with constant components equal to one, 
and Tj is the number of elements in cluster j. For any partition, we may re-order 
(non uniquely) the data points so that the assignment matrix has the same form; this is 
typically useful for the understanding of partitioning problems. 



2.1 Distortion measure 

In this paper, we consider partitioning models where each data point in cluster j is mod- 
elled by a vector (often called a centroid or a mean) Cj € W, the overall goal being to 
find a partition and a set of means so that the distortion measure X^=i Tlif=i ^till £C i — 
Cj\\ 2 is as small as possible, where || • || is the Euclidean norm in R p . By considering 

the Frobenius norm defined through = Ym=i SjLi -^|j> tms i s equivalent to 

minimizing 

||X-YC||| (1) 
with respect to an assignment matrix Y and the centroid matrix C € M KxP . 



2.2 Representing partitions 

Following El [10 1, the quadratic minimization problem in Y can be solved in closed 
form, with solution C = (Y T Y)~ 1 Y T X (it can be found by computing the matrix 
gradient and setting it to zero). Thus, the partitioning problem (with known number of 
clusters K) of minimizing the distortion in Eq. ([TJ, is equivalent to: 



mm 

Ye{o,i} TxK , Yi K =\ t 



\X - Y(Y T Y)- 1 Y T X\\% 



(2) 



Thus, the problem is naturally parameterized by the TxT-matrix M = Y (Y T Y)~ 1 Y T . 
This matrix, which we refer to as a rescaled equivalence matrix, has a specific struc- 
ture. First the matrix Y T Y is diagonal, with i-th diagonal element equal to the number 
of elements in the cluster containing the i-th data point. Thus My = if i and j are 
in different clusters and otherwise equal to 1/D where D is the number of elements in 
the cluster containing the i-th data point. Thus, if the points are re-ordered so that the 
segments are composed of contiguous elements, then we have the following form 



M 



/11 T /Ti 


V ... 



\ 







o n T /W 



In this paper, we use this representation of partitions. Note the difference with alterna- 
tive representations YY T which has values in {0, 1}, used in particular by [ 18 1. 

We denote by A4k the set of rescaled equivalence matrices, i.e., matrices M € 
E TxT such that there exists an assignment matrix Y 6 M. TxK such that M = Y(Y T Yy 1 Y T . 
For situations where the number of clusters is unspecified, we denote by M. the union 
of all Mr for if € {1, ... , N}. 



4 



Note that the number of clusters may be obtained from the trace of M, since 
TrM = TrY(Y T Y)- 1 Y T = Tr(Y T Y)- 1 Y T Y = K. This can also be seen by 
noticing that M 2 = Y(Y T Y)- 1 Y T Y(Y T Y)- 1 Y T = M, i.e., M is a projection ma- 
trix, with eigenvalues in {0, 1}, and the number of eigenvalues equal to one is exactly 
the number of clusters. Thus, M K = {M G M, TrM = K). 

Learning the number of clusters K. Given the number of clusters K, we have seen 
from Eq. |2} that the partitioning problem is equivalent to 

min \\X- MX\\% = min Tr \XX T (I - Mil . (3) 
m&Mk m&Mk L J 

In change-point detection problems, an extra constraint of contiguity of segments is 
added. 

In the common situation when the number of clusters K is unknown, then it may 
be estimated directly from data by penalizing the distortion measure by a term pro- 
portional to the number of clusters, as usually done for instance in change-point de- 
tection |19|. This is a classical idea that can be traced back to the AIC criterion [1| 
for instance. Given that the number of clusters for a rescaled equivalence matrix M is 
Tr M, this leads to the following formulation: 

min Tr \XX T (I - M)l + A Tr M (4) 

Note that our metric learning algorithm also learns this extra parameter A. 

Thus, the two types of partitioning problems (with fixed or unknown number of 
clusters) can be cast as the problem of maximizing a linear function of the form Tr(vlM) 
with respect to M G Ai, with the potential constraint that TrM = K. In general, 



such optimization problems may not be solved in polynomial time. In Section 2.3 
we show how adding contiguity constraints makes it possible to obtain a solution in 
polynomial time through dynamic programming. For general situations, the K-means 
algorithm, although not exact, can be used to get good partitioning in polynomial time. 



In Section 2.4 we provide a spectral relaxation, which we use within our large-margin 



framework in Section |4] 



2.3 Change-point detection by dynamic programming 

The change-point detection problem is a restriction of the general partitioning problem 
where the segments are composed of contiguous elements. We denote by ./Vf scq the set 
of partition matrices for the change-point detection problem, and A4^ q , its restriction 
to partitions with K segments. 

The problem is thus of solving Eq. Q (known number of clusters) or Eq. ([3]) (un- 
known number of clusters) with the extra constraint that M G Ai scq . In these two 
situations, the contiguity constraint leads to exact polynomial-time algorithms based 
on dynamic programming. See, e.g., Il24l . This leads to algorithms for maximizing 
Tt(AM), when A is positive semi-definite in 0(T 2 ). When the number of segments 
K is known the running time complexity is 0(KT 2 ). 

We now describe a reformulation that can solve ra.ax.MeM Tr(AM) for any matrix 
A (potentially with negative eigenvalues, as from Eq. Q). This algorithm is presented 
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in Algorithm [T] It only requires some preprocessing of the input matrix A, namely 
computing its summed area table / (or image integral), defined to have the same size 
as A and with = J2i><i j><j ^Vj'- m words it is the sum of the elements of A 
which are above and to the left of respectively i and j, A similar algorithm can be 
derived in the case where M g M.k- 



Algorithm 1 Dynamic programming for maximizing Ti(AM) such that M E A4 
Require: T x T matrix A 

Compute /, image integral (summed area table) of A 
Initialize C(l, :) =diag(J) 
for t = 1 : T 1 do 

C(t + 1, t + 1) = max(C(l : t, t)) +I(t + l,t+ 1) 
foru=t+l ...Tdo 

a _ I(s,s)+I(t+l,t+l)-I(s,t+l)-I(t+l,s) 
P - (u-t) 

C(t + l,u) = max(C(l : t, t)) + (3 
end for 
end for 

Backtracking steps: t c = T,Y = 
while t c > 1 do 

t° ld = t c , t c = &rgmax{C(t c ,:)} 

s = t ?-t c + l,Y=( Y ») 
end while 

return Matrix M = Y(Y T Y)- 1 Y T . 



2.4 K-means clustering and spectral relaxation 

For a known number of clusters K, K-means is an iterative algorithm aiming at mini- 
mizing the distortion measure in Eq. ([TJ: it iterates between (a) optimizing with respect 
to C, i.e., C = (Y T Y)~ 1 Y T X, and (b) minimizing with respect to Y (by assigning 
points to the closest centroids). Note that this algorithm only converges to a local min- 
imum and there is no known algorithm to perform an exact decoding in polynomial 
time in high dimensions P. Moreover, the K-means algorithm cannot be readily ap- 
plied to approximately maximize any linear function Tr AM with respect to M € M., 
i.e., when A is not positive-definite or the number of clusters is not known. 

Following [25 22, 3 |, we now present a spectral relaxation of this problem. This is 
done by relaxing the set Ai to the set of matrices that satisfy M 2 = M (i.e., removing 
the constraint that M takes a finite number of distinct values). When the number of 
clusters is known, this leads to the classical spectral relaxation, i.e., 

max Tit AM) sC max Tr(AM), 

IfeM, Tr M=K M 2 =M,TtM=K 

which is equal to the sum of the K largest eigenvalues of A; the optimal matrix M 
of the spectral relaxation is the orthogonal projector on the eigenvectors of A with K 
largest eigenvalues. 
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When the number of clusters is unknown, we have: 



max Tr(AM) < max Tr(AM) = Tr(A)+, 

MeM M 2 =M 

where Tr(A) + is the sum of positive eigenvalues of A. The optimal matrix M of the 
spectral relaxation is the orthogonal projector on the eigenvectors of A with positive 
eigenvalues. Note that in the formulation from Eq. |4]), this corresponds to thresholding 
all eigenvalues of XX T which are less than A. 

We denote by M spcc = {M <= R PxP , M 2 = M} and M s g cc = {M <= 
R PxP , M 2 = M, Tr M = K} the relaxed set of rescaled equivalence matrices. 

2.5 Metric learning 

In this paper, we consider learning a Mahalanobis metric, which may be parameterized 
by a positive definite matrix B e M PxP . This corresponds to replacing dot-products 
xjxj by xj Bxj, and XX T by XBX T . Thus, when the number of cluster is known, 
this corresponds to 

min Tr \XBX T (I - M)] (5) 
MeM K L 1 

or, when the number of clusters is unknown, to: 

min Tr \BX T (I - M)X] + A Tr M. (6) 

MGM L J 

Note that by replacing B by BX and dividing the equation by A, we may use an equiv- 
alent formulation of Eq. |6]) with A = 1, that is: 

min Tr \XBX T (I - M)l + Tr M. (7) 
mgM l j 

The key aspect of the partitioning problem is that it is formulated as optimizing with 
respect to M a function linearly parameterized by B. The linear parametrization in M 
will be useful when defining proper losses and efficient loss-augmented inference in 
Sectiong] 

Note that we may allow B to be just positive semi-definite. In that case, the zero- 
eigenvalues of the pseudo-metric corresponds to irrelevant dimensions. That means in 
particular we have performed dimensionality reduction on the input data. We propose 
a simple way to encourage this desirable property in Section|4~3| 



3 Loss between partitions 

Before going further and apply the framework of Structured prediction 1 29 1 in the con- 
text of metric learning, we need to find a loss on the output space of possible partition- 
ing which is well suited to our context. To avoid any notation conflict, we will refer in 
that section to V as a general set of partition (it can corresponds for instance to A^ scq ). 
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3.1 Some standard loss 

The Rand index When comparing partitions lfl6l . a standard way to measure how 
different two of them are is to use the Rand l23l index which is defined, for two parti- 
tions of the same set of T elements S P 1 = {P?,..., P? 1 } and P 2 {P%, • ■ ■ , P* 2 } as 
the sum of concordant pairs over the number of possible pairs. More precisely, if we 
consider all the possible pairs of elements of S, the concordant pairs are defined as the 
sum of the pairs of elements which both belong to the same set in Pi and P2 and of the 
pairs which are not in the same set both in Pi and P2. In matricial terms, it is linked to 
the Frobenius distance between the equivalence matrices representing Pi and P2 (these 
matrices are binary matrices of size T x T which are 1 if and only if the element i and 
the element j belong to the same set of the partition). 

This loss is not necessarily very well suited to our problem, since intuitively one can 
see that it doesn't take into account the size of each subset inside the partition, whereas 
our concern is to optimize intra class variance which is a rescaled indicator. 

Hausdorff distance In the change-point detection litterature, a very common way to 
measure dissimilarities between partitions is the so-called Hausdorff distance |6 1 on the 
elements of the frontier of the elements of the partitions (the need for a frontier makes 
it inapplicable directly to the case of general clustering). Let's consider two partitions 
of a finite set S of T elements. We assume that the elements have a sequential order 
and thus elements of partitions Pi and P2 have to be contiguous. It is then possible 
to define the frontier (or set of ruptures) of Pi as the collection of indexes dPi = 
{'mi Pi, ... , inf P^}. Then, by embedding the set S into [0, 1] (it corresponds just to 
normalize the time indexes so that they are in [0, 1]), we can consider a distance d on 
[0, 1], (typically the absolute value) and then define the associated Hausdorff distance 
d#(Pi,P 2 ) = maxjsup^^ M yed p 2 d{x,y),sup yed p 2 inf x&dPl d(x,y)} 

The loss considered in our context In this paper, we consider the following loss, 
which was originated proposed in a slightly different form by |[T6l and has then been 
widely used in the field of clustering 1 3 1. This loss is a variation of the \ 2 association in 
a Ki x K2 contingency table (see |[T6l ). More precisely, if we consider the contingency 
table associated to Pi (partition of a set of size T) with Ki elements and P2 with K2 
elements (the contingency table being the Ki x K2 table C such that Cij = n.ij 
the number of elements in element i of Pi and in element j of P2), we have that 
\\M-N\\% = Ki+K 2 -*^±t. 

l(M, N) = i||Af-iV|| 2 F = l(Tr(M) +Tr(iV) -2Ti(MN)). (8) 

Moreover, if the partitions encoded by M and N have clusters P* , . . . , P^ 1 and 

Pi,..., P-f 2 , then Tl{M, N) = K x + K 2 - 2 J2 k .i This loSS is equal t0 

zero if the partitions are equal, and always less than ^ (K + L — 2). Another equivalent 
interpretation of this index is given by, with the usual convention that for the element 
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of S indexed by i Pi (i) is the subset of Pi where i belongs: 



T£(M,N)=K 1 + K 2 -2J2 



\Pi(i)nP 2 (i)\ 

Pi(i)\x\p 2 (i)\- 



This index seems intuitively much more suited to the study of the problem of vari- 
ance minimization since it involves the rescaled equivalence matrices which parametrize 
naturally these kind of problems. We examine in the Appendix more facts about these 
losses and their links, especially about the asymptotic behaviour of the loss we use 
in the paper. We also show a link between this loss and the Hausdorff in the case of 
change-point detection. 



4 Structured prediction for metric learning 

As shown in the previous section, our goal is to learn a positive definite matrix B, in 
order to improve the performance of structured output algorithm that minimizes with 
respect to M € M, the following cost function of Eq. [7] Using the change of variable 
described in the table below, the partitioning problem may be cast as 

max (io, tp(X, M)) or max (w,<p(X,M)). 

MeM M€M K 



where (A, B) is the Frobenius dot product. 



Number of clusters 


tp(X, M) 


w 


Known 


X T MX 


B 


(Tr M = K) 






Unknown 


i ( X T MX \ 
7 [ MJ 


(: -.) 



We denote by T the vector space where the vector w defined above belongs to. Our 
goal is thus to estimate w € T from pairs of observations [X^, Mi) € X x M.. This 
is exactly the goal of large-margin structured prediction ||29ll , which we now present. 
We denote by M a generic set of matrices, which may either be A4, .M spcc , A4 seq , 
M.K, A4 S ^ CC , M s ^ q , depending on the situation (see Section 4.2 for specific cases). 



4.1 Large-margin structured output learning 

In the margin-rescaling framework of ll29l . using a certain loss £ : Af x J\f — >• M. + 
between elements of Af (here partitions), the goal is to minimize with respect to w € T, 



1 N 

— e ( argmaxjifgjv^, ^p{X l ,M)),M i 



where is any (typically convex) regularizes This framework is standard in machine 
learning in general and metric learning in particular (see e.g, IT7l ). This loss function 
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w i ^ £ ^ argmax MgJV - (ui, <p(Xi, M )), MA is not convex in M, and may be replaced 
by the convex surrogate 

L t {w) = max {£(M, Mi) + (w, ip(X u M) - <p{X h Mi))}, 

leading to the minimization of 

1 N 

-£>i(«o + n(«0- (9) 

In order to apply this framework, several elements are needed: (a) a regularizer f2, 
(b) a loss function £, and (c) the associated efficient algorithms for computing Li, i.e., 
solving the loss-augmented inference problem m&XMeM Mi) + (w, ip(Xi, M) — 

<p(Xi,Mi))}. 

As discussed in Section[3] a natural loss on our output space is given by the Frobe- 
nius norm of the rescaled equivalence matrices associated to partitions. 

4.2 Loss-augmented inference problem 

Efficient minimization is key to the applicability of large-margin structured prediction 
and this problem is a classical computational bottleneck. In our situation the cardinality 
of Af is exponential, but the choice of loss between partitions lead to the problem 

max^gAf Tr(AjM) where: 

- Ai = ^(XiBXj — 2Mi + Id) if the number of clusters is known. 

- A4 = ^(XiBXj - 2Mi) otherwise. 

Thus, the loss-augmented problem may be performed for the change-point problems 
exactly (see Section [23] l or through a spectral relaxation otherwise (see Section [2~4| i. 
Namely, for change-point detection problems, Af is either A^ scq or A4 s £ q , while for 
general partitioning problems, it is either A^ spcc or A4 S £ CC . 

4.3 Regularizer 

We may consider several parametrizations/regularizers for our positive semidefinite 
matrix B. We may classically (see e.g, lfT7l ) penalize Tr B 2 = ||-B||p, which is the 
classical squared Euclidean norm. However, two variants of our algorithm are often 
needed for practical problems. 

Diagonal metric. To limit the number of parameters, we may be interested in only 
reweighting the different dimensions of the input data, i.e., we can impose the metric 
to be diagonal, i.e, B — Diag(fe) where b £ R p . Then, the constraint is b ^ 0, and we 
may penalize by ||6||i = lj,b or H&lH, depending whether we want to promote zeros in 
b (i.e., to do feature selection). 

Low-rank metric. Another potentially desirable property is the interpretability of 
the obtained metric in terms of its eigenvectors. Ideally we want to have a pseudo- 
metric with a small rank. As it is classically done, we relaxed it into the sum of singular 
values. Here, since the matrix B is symmetric positive definite, this is simply the trace 
Tv(B). 
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4.4 Optimization 

In order to optimize the objective function of Eq. |9]), we can use several optimiza- 
tion techniques. This objective present the drawback of being non-smooth and thus the 
convergence speed that we can expect are not very fast. 

In the structured prediction litterature, the most common solvers are based on cutting- 
plane methods (see |29l ) which can be used in our case for small dimensional-problem 
(i.e., low P). Otherwise we use a projected subgradient method, which leads to more 
numerous but cheaper iterations. Cutting plane and Bundle methods ll28l shows the 
best speed performances when the dimension of the feature space of the data to par- 
tition is low, but were empirically outperformed by a subgradient in the very high 
dimensional setting. 



5 Extensions 

We now present extensions which make our metric learning more generally applicable. 

5.1 Spectral clustering and normalized cuts 

Normalized cut segmentation is a graph-based formulation for clustering aiming at 
finding roughly balanced cuts in graphs [25 1. The input data X is now replaced by a 
similarity matrix W G M^ xT and, for a known number of clusters K, as shown by 
ll22l [3ll. it is exactly equivalent to 

max Tr MW 

MeM K 

where W — Diag(Wl)~ 1 ^ 2 W Diag(W / l)~ 1 / 2 is the normalized similarity matrix. 

Parametrization of the similarity matrix W. Typically, given data points x±, . . . , xt € 
M p (in image segmentation problem, these are often the concatenation of the positions 
in the image and local feature vectors), the similarity matrix is computed as 

(W B )ij = exp ( - (xi - Xj) T B(xi - Xj)), (10) 

where B is a positive semidefinite matrix. Learning the matrix B is thus of key practical 
importance. 

However, our formulation would lead to efficiently learning (as a convex optimiza- 
tion problem) parameters only for a linear parametrization of W. While the linear 
combination is attractive computationally, we follow the experience from the super- 
vised setting where learning linear combinations of kernels, while formulated as a con- 
vex problem, does not significantly improve on methods that learn the metric within a 
Gaussian kernel with non-convex approaches (see, e.g., |[T2l|20ll ). 



We thus stick to the parametrization of Eq. ( 10 1. In order to make the problem 
simpler and more tractable, we consider spectral clustering directly with W and not 
with its normalized version, i.e., our partitioning problem becomes 

max Tr WM or max Tr WM. 

MGM MeM K 
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In order to solve the previous problem, the spectral relaxation outlined in Section 2.4 
may be used, and corresponds to computing the eigenvectors of W (the first K ones if 
K is known, and the ones corresponding to eigenvalues greater than a certain threshold 
otherwise). 

Non-convex optimization. In our structured output prediction formulation, the loss 
function for the z-th observation becomes (for the case where the number of clusters is 
known): 

^max oc {£(M, MA + Tr W B (M -Mi)} 
= - Tr W B Mi + max (£(M, MA + Tr W B M\ . 

It is not a convex function of B, however, it is a difference of a concave and a convex 
function, which can be dealt with using majorization-minimization algorithm ll33l . The 
idea of this algorithm is simply to upper-bound the concave part — Tr W B M^ by its 
linear tangent. Then the problem becomes convex and can be optimized using one of 



the method proposed in Section 4.4 We then iterate the process, which is known to be 
converging to a stationary point. 



5.2 Partial labellings 

The large-margin convex optimization framework relies on fully labelled datasets, i.e., 
pairs (Xi,Mi) where Xi is a dataset and M, the corresponding rescaled equivalence 
matrix. In many situations however, only partial information is available. In these 
situations, starting from the PCA metric, we propose to iterate between (a) label all 
datasets using the current metric and respecting the constraints imposed by the partial 
labels and (b) leam the metric using Section [4] from the fully labelled datasets. See an 
application in Section [6TT] 



5.3 Detecting changes in distribution of temporal signals 

In sequential problems, for now, we are just able to detect changes in the mean of 
the distribution of time series but not to detect change-points in the whole distribu- 
tion (e.g., the mean may be constant but the variance piecewise constant). Let us 
consider a temporal series X in which some breakpoints occur in the distribution of 
the data. From this single series, we build several series permitting to detect these 
changes, by considering features built from X, in which the change of distribution 
appears as a change in mean. A naive way would be to consider the moments of 
the data X, X 2 ,X 3 , . . . ,X r but unfortunately as r grows these moments explode. 
A way to prevent them from exploding is to use the robust Hermite moments ||3T1 . 
These moments are computed using the Hermite functions and permit to consider the 
p-dimensional series H 1 (X), H 2 (X), . . . , where Hi(X) is the z-th Hermite function 

Hi(x) = 2V2 l 7ri!e"^(-l) i 2 l / 2 e^ £i(e^). 
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Bioinformatics application. Detection of change-points in DNA sequences for can- 
cer prognosis provides a natural testbed for this approach. Indeed, in this field, re- 
searchers face data which are linked to the number of copies of each gene along the 
DNA (a-CGH data as used in |Q3)). The presence of such changes are generally related 
to the development of certain types of cancers. On the data from the Neuroblastoma 
dataset [15], some caryotypes with changes of distribution were manually annotated. 
Without any metric learning, the global error rate in change-point identification is 12%. 
By considering the first 5 Hermite moments and learning a metric, we reach a rate of 
6.9%, thus improving significantly the performance. 

6 Experiments 

We have conducted a series of experiments showing improvements of our large-margin 
metric learning methods over previous metric learning techniques. 

6.1 Change point detection 

Synthetic examples and robustness to lack of information. We consider 300- 
dimensional time series of length T = 600 with an unknown number of breakpoints. 
Among these series only 10 are relevant to the problem of change-point detection, i.e., 
290 series have abrupt changes which should be discarded. Since the identity of the 10 
relevant time series is unknown, by learning a metric we hope to obtain high weights 
on the relevant series and small weights on the others. The number of segments is not 
assumed to be known and is learned automatically. 

Moreover, in this experiment we progressively remove information, in the sense 
that as input of the algorithm we only give a fraction of the original time series (and we 
measure the amount of information given through the ratio of the given temporal series 
compared to the original one). Results are presented in Figure[TJ As expected, the per- 
formance without metric learning is bad, while it is improved with PCA. Techniques 
such as RCA [4 1 which use the labels improve even more (all datasets were stacked into 
a single one with the corresponding supervision); however, it is not directly adapted to 
change-point detection, it requirse dimensionality reduction to work and the perfor- 
mance is not robust to the choice of the number of dimensions. Note also that all 
methods except ours are given the exact number of change-points. Our large-margin 
approach outperforms the other metric, in the convex setting (i.e., extreme right of the 
curves), but also in partially-supervised setting where we use the alternative approach 
describe in Section lBT2l 

Video segmentation. We applied our method to data coming from old TV shows 
(the length of the time series in that case is about 5400, with 60 to 120 change -points) 
where some speaking passages alternate with singing ones. The videos are from lh up 
to lh30 long. We aim at recovering the segmentation induced by the speaking parts 
and the musical ones. Following [2], we use GIST features for the video part and 
MFCC features for the audio. The features were aggregated every second so that the 
temporal series we are considering are about several thousands vectors long, which is 
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Figure 1: Performances on synthetic data vs. the quantity of information available in 
the time series. Note the small error bars. We compare ourselves against a metric 
learned by RCA (with 3 or 4 components), an exhaustive search for one regularization 
parameter, and PCA. 
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Table 1: Empirical performance on each of the three TV shows used for testing. 
Each subcolumn stands for a different TV show. The smaller the loss is, the better the 
segmentation is. 



Method 


Audio 


Video 


Both 


PCA 


23 


41 


34 


40 


55 


25 


29 


53 


37 


Reg. parameter 


29 


48 


33 


59 


55 


47 


40 


48 


36 


Metric learning 


6.1 


9.3 


7 


10 


14 


11 


8.7 


9.6 


7.8 



Table 2: Performance of the metric learning versus the Euclidean distance, and other 
metric learning algorithms such as RCA or [32|. We use the loss from Eq. ((H). 



Dataset 


Ours 


Euclidean 


RCA 


E2] 


Iris 


0.18 ± 0.01 


0.55 ± 10- 11 


0.43 ± 0.02 


0.30 ± 0.01 


Wine 


1.03 ± 0.04 


3.4 ± 3.10~ 4 


0.88 ±0.14 


3.08 ±0.1 


Letters 


34.5 ± 0.1 


41.62 ±0.2 


34.8 ± 0.5 


35.26 ± 0.1 


Mov. Libras 


14 ± 1 


15 ±0.2 


22 ±2 


15.07 ± 1 



still computationally tractable using the dynamic programming of Algorithm [T] We 
used 4 shows for train, 3 for validation, 3 for test. The running times of our Matlab 
implementation were in order of a few hours. 

The results are described in Table [T] We consider three different settings: using 
only the image stream, only the audio stream or both. In these three cases, we consider 
using the existing metric (no learning), PCA, or our approach. In all settings, metric 
learning improves performance. Note that the performance is best with only the audio 
stream and our metric learning, given both streams, manages to do almost as well as 
with only the audio stream, thus illustrating the robustness of using metric learning in 
this context. 

6.2 K-means clustering 

Using the partition induced by the classes as ground truth, we tested our algorithm 
on some classification datasets from the UCI machine learning repository, using the 
classification information as partitions, following the methodology proposed by 11321 . 
This application of our framework is a little extreme in the sense that we assume only 
one partitioning as training point (i.e., N = 1). The results are presented in Table 
[2] For the "Letters" and "Mov. Libras" datasets, there are no significant differences, 
while for the "Wine" dataset, RCA is the best, and for the "Iris" dataset, our large- 
margin approach is best: even in this extreme case, we are competitive with existing 
techniques. 
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Table 3: Performance of the metric learned in the context of image segmentation, 
comparing the result of a learned metric vs. the results of an exhaustive grid search 
(Grid), a is the standard deviation of the difference between the loss with our metric 
and the grid search. To assess the significance of our results, we perform t-tests whose 
p-values are respectively 2.10~ 9 and 4.10 -9 . 



Loss used 


Learned metric 


Grid 


a 


Loss of Eq. ([8] 


1.54 


1.77 


0.3 


Jaccard distance 


0.45 


0.53 


0.11 



6.3 Image Segmentation 

We now consider learning metrics for normalized cuts and consider the Weizmann 
horses database [5|, for which groundtruth segmentation is available. Using color and 
position features, we learn a metric with the method presented in Section [5T| on 10 
fully labelled images. We then test on the remaining 318 images. 

We compare the results of this procedure to a cross-validation approach with an 
exhaustive search on a 2D grid adjusting one parameter for the position features and one 
other for color ones. The loss between groundtruth and segmentations obtained by the 
normalized cuts algorithm is measured either by Eq. ([8]) or the Jaccard distance. Results 
are summarized in Table[3] with some visual examples in Figure|2] The metric learning 
within the Gaussian kernel significantly improves performance. The running times of 
our pure Matlab implementation were in order of several hours to get convergence of 
the convex-concave procedure we used. 

7 Conclusion 

We have presented a large-margin framework to learn metrics for unsupervised par- 
titioning problems, with application in particular to change-point detection in video 
streams and image segmentation, with a significant improvement in partitioning per- 
formance. For the applicative part, following recent trends in image segmentation (see, 
e.g., lfT8l ). it would be interesting to extend our change-point framework so that it 
allows unsupervised co-segmentation of several videos: each segment could then be 
automatically labelled so that segments from different videos but with the same label 
correspond to the same action. 
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Figure 2: From left to right: image segmented with our learned metric, image seg- 
mented by a parameter adjusted by exhaustive search, groundtruth segmentation, orig- 
inal image in gray. 
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A Asymptotics of the loss between partitions 

Note that in this section, we will denote by d 2 F the "normalized" loss between parti- 
tions. This means that, with the notations of the article when considering two matrices 
M and N representing some partitions P and Q in the generic set of partitions V, 
we have Td 2 F = \\M — N\\%. Throughout this section, we will refer to the size of a 
partition as the number of clusters. 

A.l Hypothesis 

• We assume we consider P and Q two partitions of the same size, with a common 
number of clusters K. 

• Vfc, I € {1, ... , K}, we denote e^i = \Pk H Qi |, the flow which goes out from 
P to Q when P goes to Q. 

• We define the global outer flow as efc_>. — J2i^k e k^i an d the global inner flow 

as e^i = J2i^ k £ k^i 

A.l Main result 

Theorem 1. Let P and Q two partitions satisfying our hypothesis. If we note M (P, Q) = 



max fc ^ { mira (|p+|| Pi |) }, then 35 : V 2 -^Rsuch that sup P! Q, K xM(P,Q)<e \ 6 ( p > Q)\ ->e->-o 
and \/P, Q E V of the same size K, T, 



2005. 



K 



Td%(P,Q) = 2j2i 



) x (l + d(P,Q)) 



fc=i 



Proof. From the expressions of Section 3.1 we can write : 




K 



\P k nQ k \ 2 
\Qk\\P k \ 



2D 1 



t k->-l 



k=i 
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The second term can be pretty easily bounded using M 

2 V ^ < 2AfV — 

ft |P fe |(|P,|-e fe ^+e^) " ^|PH-e fc V 

We can go further, noticing that efc_>. < ifM|Pfc|, which leads eventually to, if M < 
1/2K (and this is the case if 5 tends to in the sense of the assumption of the theorem): 



ft IftKlfll - «U + e^ fe ) " ft |P| - " ft P 



Now, let's bound the first term, which is a little more long: 



1 \Pkf\Qk\ 2 l _ (\Pk-e k .+ ) 



\Qk\\Pk\ \P k {\P k \-e k ^+e^ k ) 



x 



+ ~y^ k ' \Pk\(\Pk\ - e k ^ + e^ k ) 



But, for the same reasons as when we bounded the second term 

9 K o 



< 



\P k \(\P k \ - e k ^ + e^ k ) ~ ftjftl 2 ' 



Using the fact that Vfc, (K)M > ^*=*j-, we finally get that, when M < 1/2K: 



\p k \(\p k \-e k ^ + e^ k ) - fr[\Pk\' 



<4M]T efc - 



Thus, putting everything together, when KM — > 0, we get the statement of the 
theorem. □ 



B Equivalence between the loss between partition and 
the Hausdorff distance for change point detection 

As mentioned in the title of this , there is a deep link between the Hausdorff distance 
and the distance between partition we used throughout this paper in the case of change- 
point detection applications. We propose here to show that the two distances are equiv- 
alent. 
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B.l Hypothesis and notations 



• We consider the segmentations P and Q has having been embedded in [0, 1] 
so that we can consider a distance d on [0, 1] to define the Hausdorff distance 
between the frontiers of the elements of P and Q. 

• We denote l m (P) the minimal length of a segment in a partition P e V and l ma 
the maximal one. 

• We denote by dh the Hausdorff distance between partitions as described in Sec- 
tion [3] 



B.2 Main result 

Theorem 2. Let P,Q denote two partitions. If\P\ = \Q\ and dh(P, Q) — e< \l m {P), 
then we have the following: 

< d 2 F {P,Q) < 12K- 



lma (^) 

Moreover, without assuming \P\ — \Q\, we get 

^ Q) ~ max(l ma (P),l m (Q))) " T 

Proof. First, let's do the majorization part Using the expressions of Section [3T| we 
have to minorate Yl,ki=i ^p^\q\\ • Note that the hypothesis of the Hausdorff distane 
being inferior to the half of the minimal length is just here to say that the Z-th segment 
of partition Q can only overlap with I — 1-th, Ith and I + 1-th elements of P. Thus : 

^ \PknQil = \P k nQ k \ 2 ^ |p fc nQ fc+1 | 2 ^ \P k nQ k ^\ 2 
k % \P k \\Qi\ k % \Pk\\Q k \ \Pk\\Q k+ i\ ^ \P k \\Qk-x\ 



> 



^ (\Pk\-2e) 2 



/.-J i_r2 IAI 

K 

~p k \ 



K - 6 



k=l 

eK 



lm(P) 



which gives us the majorization. Note that we used the fact that Vx e [0, 1], the in- 
equality ^"7^ > 1 — 3x holds. 



21 



For the minoration, note that it is true all the time, but we will just give the proof 
in the case where the Hausdorff distance is such that dh(P, Q) < l m {P)/2 and where 

1^1 = 101- 

First, let's begin by some general statements : 

i) By definition e = max{max Ae9P miriQ. edQ d(Pi, Qj)maxQ. edQ mm P . edP d(Q u Pj 

ii) If the first term in the max is attained, that means there exists some such 
that \Pi* — Qj* | = e. It also means that, if we look at the sequences, there is no 
elements of dQ is between Pi- and Qj*. Thus, by definition of the loss d 2 F (P 1 Q) > 
2X) Q eP 3 nQ »_! pePj\Q p^ 2 , and a short computation leads to cZ|,(P, Q) > 2 j-^j (1— 

iii) If the second term in the max is attained, the same minoration holds by permuting 
indices. 

Let's go back to our special case, we have \P*\ > 2e and \Q* \ > 2e. This leads to 

4(p ' Q) - max( dn'do) ) 

□ 
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